Dataset Name: Monthly and Annual Energy Consumption by Sector
This dataset includes monthly data for energy production and consumption by different sectors depending on the sources, the origins, and the usages since 1973, and annual data since 1949 for the United States.This dataset is collected and published by the US Energy Information Administration on a monthly basis, where the organization release preliminary data for the new month, often estimated, and revise previous numbers.
The data is available both in csv files, xls files, and pdf ready to be processed. For this project, we have downloaded 19 excels files and did initial analysis on the missing data in each of this excel files. Then we have combined the data from all the excel files into a single excel file for easy processing and analysis of the data.
The original excel dataset contained metadata information like source, titles, etc. The top 9 rows of the excel contains metadata information which distorts the structure of the data. Therefore these metadata information is filtered out while importing the data into the R environment for easy data analysis.
The datatype of columns from column 2 to end is changed to numeric datatype. The datatype of column 1 is date.
Each columns starting from Total Fossil Fuels Production shows energy production in Quadrillion Btu. 1 quad (short for quadrillion Btu) is 10^15 Btu, which is about 1 exajoule (1.055×10^18 J). Quads are used in the United States for representing the annual energy consumption of large economies: for example, the U.S. economy used 99.75 quads in 2005. One quad/year is about 33.43 gigawatts. (Source:British thermal unit)
The dataset is summarized to get the amount of missing values in each column. As per the above table, there is no Null values in any of the columns of the dataset.
Code
vis_miss(energy_df_monthly) +theme(axis.text.x =element_text(angle =270, hjust =1),axis.title =element_text(size =10),plot.title =element_text(hjust =0.5, face ='bold', color ='darkblue'),plot.subtitle =element_text(hjust =0.5, color ='darkblue') ) +labs(title ="Percentage of datapoints missing in each column",subtitle ="Table: Evergy Overview Data (Monthly)",x ="Column Variables",y ="Number of Observations" )
Code
vis_miss(energy_df_annual) +theme(axis.text.x =element_text(angle=270, hjust=1),axis.title =element_text(size=10),plot.title =element_text(hjust=0.5, face='bold', color='darkblue'),plot.subtitle =element_text(hjust =0.5, color ='darkblue') ) +labs(title ="Percentage of datapoints missing in each column",subtitle ="Table: Evergy Overview Data (Annually)",x ="Column Variables",y ="Number of Observations" )
The above visualization shows each column variable has 0% missing data. There is no null or missing value in any of the columns of the dataset.
Code
gg_miss_var(energy_df_monthly) +theme(plot.title =element_text(hjust =0.5, color ="darkblue", face ='bold'),plot.subtitle =element_text(hjust =0.5, color ='darkblue') ) +labs(title ="Missing Values by Column Variables",subtitle ="Table: Evergy Overview Data (Monthly)",x ="Column Variables",y ="Number of Missing Values" )
Code
gg_miss_var(energy_df_annual) +theme(plot.title =element_text(face='bold', color='darkblue'),plot.subtitle =element_text(hjust =0.5, color ='darkblue') ) +labs(title ="Missing Values by Column Variables (Annual Energy Data)",subtitle ="Table: Evergy Overview Data (Annually)",x ="Column Variables",y ="Number of Missing Values" )
The above visualization is a another representation of data missingness. The above datapoints are all present at 0 mark. It shows none of the columns in the dataset has missing value.
vis_miss(energy_df_monthly_pe_source) +theme(axis.text.x =element_text(angle=270, hjust=1),axis.title =element_text(size=10),plot.title =element_text(hjust=0.5, face='bold', color='darkblue'),plot.subtitle =element_text(hjust=0.5, color='darkblue') ) +labs(title ="Percentage of datapoints missing in each column",subtitle ="Table: Primary Energy Production By Source (Monthly Data)",x ="Column Variables",y ="% of Observations" )
Code
vis_miss(energy_df_annual_pe_source) +theme(axis.text.x =element_text(angle=270, hjust=1),axis.title =element_text(size=10),plot.title =element_text(hjust=0.5, face='bold', color='darkblue'),plot.subtitle =element_text(hjust=0.5, color='darkblue') ) +labs(title ="Percentage of datapoints missing in each column",subtitle ="Table: Primary Energy Production By Source (Annual Data)",x ="Column Variables",y ="Number of Observations" )
Monthly Data :
“Solar Energy Production” has 21% of its data missing.
“Wind Energy Production” has 19% of its data missing.
Overall, 2.9% of the total data is missing, while 97.1% is present.
Annual Data :
Most columns have no missing data (0% missing), including “Annual Total,” “Coal Production,” “Natural Gas (Dry) Production,” “Crude Oil Production,” and others.
“Geothermal Energy Production” has 15% of its data missing. - “Solar Energy Production” has the highest percentage of missing data at 47%.
“Wind Energy Production” also has a significant amount of missing data at 45%.
Overall, 7.6% of the total data is missing, while 92.4% is present.
vis_miss(energy_df_monthly_pe_consumption_source) +theme(axis.text.x =element_text(angle=270, hjust=1),axis.title =element_text(size=10),plot.title =element_text(hjust=0.5, face='bold', color='darkblue'),plot.subtitle =element_text(hjust=0.5, color='darkblue') ) +labs(title ="Percentage of datapoints missing in each column",subtitle ="Table: Primary Energy Consumption By Source (Monthly Data)",x ="Column Variables",y ="Number of Observations" )
Code
vis_miss(energy_df_annual_pe_consumption_source) +theme(axis.text.x =element_text(angle=270, hjust=1),axis.title =element_text(size=10),plot.title =element_text(hjust=0.5, face='bold', color='darkblue'),plot.subtitle =element_text(hjust=0.5, color='darkblue') ) +labs(title ="Percentage of datapoints missing in each column",subtitle ="Table: Primary Energy Consumption By Source (Annual Data)",x ="Column Variables",y ="Number of Observations" )
Monthly Data :
Wind Energy Consumption has the highest percentage of missing data points at 19%. This suggests that wind energy consumption data may be less reliable or less frequently collected compared to other energy sources.
Solar Energy Consumption also has a high percentage of missing data points at 21%. This could be due to similar reasons as wind energy, or potentially due to the more recent and rapid growth of solar energy compared to other sources.
Annual Data :
The column with the highest percentage of missing data points is “Wind Energy Consumption,” with 45% of its data missing. This is followed by “Solar Energy Consumption” with 47% missing data.
Most columns have no missing data points. These include columns like “Consumption,” “Total Renewable Energy Consumption,” and “Total Fossil Fuels Consumption.”
vis_miss(energy_df_monthly_pe_exports_source) +theme(axis.text.x =element_text(angle=270, hjust=1),axis.title =element_text(size=10),plot.title =element_text(hjust=0.5, face='bold', color='darkblue'),plot.subtitle =element_text(hjust=0.5, color='darkblue') ) +labs(title ="Percentage of datapoints missing in each column",subtitle ="Table: Primary Energy Exports By Source (Monthly Data)",x ="Column Variables",y ="Number of Observations" )
Code
vis_miss(energy_df_annual_pe_exports_source) +theme(axis.text.x =element_text(angle=270, hjust=1),axis.title =element_text(size=10),plot.title =element_text(hjust=0.5, face='bold', color='darkblue'),plot.subtitle =element_text(hjust=0.5, color='darkblue') ) +labs(title ="Percentage of datapoints missing in each column",subtitle ="Table: Primary Energy Exports By Source (Annual Data)",x ="Column Variables",y ="Number of Observations" )
Monthly Data:
Approximately 5.4% of data points are missing across all columns. This means that out of all the data points in the dataset, 5.4% are not available.
The majority of columns have zero levels of missing data (0%). This indicates that for most variables, the data is mostly complete. There is one column, “Biomass Exports,” with a significantly higher proportion of missing data (54%). This suggests that this particular variable has the most incomplete data within the dataset.
Annual Data:
Approximately 8.9% of data points are missing across all columns. This means that out of all the data points in the dataset, 8.9% are not available.
The majority of columns have zero levels of missing data (0%). This indicates that for most variables, the data is mostly complete. There is one column, “Electricity Exports,” with a significantly higher proportion of missing data (69%). This suggests that this particular variable has the most incomplete data within the dataset.
2.3.6 Energy Consumption by Residentaial, Commercial and Industrial Sectors Data
vis_miss(energy_df_monthly_consumption_sector_1) +theme(axis.text.x =element_text(angle=270, hjust=1),axis.title =element_text(size=10),plot.title =element_text(hjust=0.5, face='bold', color='darkblue'),plot.subtitle =element_text(hjust=0.5, color='darkblue') ) +labs(title ="Percentage of datapoints missing in each column",subtitle ="Table: Energy Consumption By Residential,Commercial and Industrial Sectors (Monthly Data)",x ="Column Variables",y ="Number of Observations" )
Code
vis_miss(energy_df_annual_consumption_sector_1) +theme(axis.text.x =element_text(angle=270, hjust=1),axis.title =element_text(size=10),plot.title =element_text(hjust=0.5, face='bold', color='darkblue'),plot.subtitle =element_text(hjust=0.5, color='darkblue') ) +labs(title ="Percentage of datapoints missing in each column",subtitle ="Table: Energy Consumption By Residential,Commercial and Industrial Sectors (Annual Data)",x ="Column Variables",y ="Number of Observations" )
Both of the charts show that all columns in the table have 0% missing datapoints. This means that there are no missing values in any of the columns. Having no missing data is generally good because it allows for more accurate and reliable analysis. It means that all the data points are present and can be used for calculations and modeling without any gaps.
2.3.7 Energy Consumption By Transportaion, End and Electric Power Sectors Data
vis_miss(energy_df_monthly_consumption_sector_1) +theme(axis.text.x =element_text(angle=270, hjust=1),axis.title =element_text(size=10),plot.title =element_text(hjust=0.5, face='bold', color='darkblue'),plot.subtitle =element_text(hjust=0.5, color='darkblue') ) +labs(title ="Percentage of datapoints missing in each column",subtitle ="Table: Energy Consumption By Transportation, Total End Use and Electric Power Sector (Monthly Data)",x ="Column Variables",y ="Number of Observations" )
Code
vis_miss(energy_df_annual_consumption_sector_1) +theme(axis.text.x =element_text(angle=270, hjust=1),axis.title =element_text(size=10),plot.title =element_text(hjust=0.5, face='bold', color='darkblue'),plot.subtitle =element_text(hjust=0.5, color='darkblue') ) +labs(title ="Percentage of datapoints missing in each column",subtitle ="Table: Energy Consumption By Transportation, Total End Use and Electric Power Sector (Annual Data)",x ="Column Variables",y ="Number of Observations" )
Both of the charts show that all columns in the table have 0% missing datapoints. This means that there are no missing values in any of the columns. Having no missing data is generally good because it allows for more accurate and reliable analysis. It means that all the data points are present and can be used for calculations and modeling without any gaps.
vis_miss(energy_df_monthly_petroleum) +theme(axis.text.x =element_text(angle=270, hjust=1),axis.title =element_text(size=10),plot.title =element_text(hjust=0.5, face='bold', color='darkblue'),plot.subtitle =element_text(hjust=0.5, color='darkblue') ) +labs(title ="Percentage of datapoints missing in each column",subtitle ="Table: Petroleum Overview Data (Monthly Data)",x ="Column Variables",y ="Number of Observations" )
Code
vis_miss(energy_df_annual_petroleum) +theme(axis.text.x =element_text(angle=270, hjust=1),axis.title =element_text(size=10),plot.title =element_text(hjust=0.5, face='bold', color='darkblue'),plot.subtitle =element_text(hjust=0.5, color='darkblue') ) +labs(title ="Percentage of datapoints missing in each column",subtitle ="Table: Petroleum OVerview Data (Annual Data)",x ="Column Variables",y ="Number of Observations" )
“Biofuels Plant Net Production” has the highest percentage of missing data points. There are 69% and 80% of data missing in each of the monthly and annual dataset.
2.3.9 Petroleum Consumption By Residentail and Commercial Sectors Data
vis_miss(energy_df_monthly_petroleum_consumption_res_comm) +theme(axis.text.x =element_text(angle=270, hjust=1),axis.title =element_text(size=10),plot.title =element_text(hjust=0.5, face='bold', color='darkblue'),plot.subtitle =element_text(hjust=0.5, color='darkblue') ) +labs(title ="Percentage of datapoints missing in each column",subtitle ="Table: Petroleum Consumption By Residential and Commercial Data (Monthly Data)",x ="Column Variables",y ="Number of Observations" )
Code
vis_miss(energy_df_annual_petroleum_consumption_res_comm) +theme(axis.text.x =element_text(angle=270, hjust=1),axis.title =element_text(size=10),plot.title =element_text(hjust=0.5, face='bold', color='darkblue'),plot.subtitle =element_text(hjust=0.5, color='darkblue') ) +labs(title ="Percentage of datapoints missing in each column",subtitle ="Table: Petroleum Consumption By Residential and Commercial Data (Annual Data)",x ="Column Variables",y ="Number of Observations" )
Monthly Data:
The chart highlights that the majority of the data points in the dataset are present, with only one column (“Petroleum Coke Consumed by the Commercial Sector”) having a significant amount of missing data (31%).
Annual Data:
The chart highlights that the majority of the data points in the dataset are present, with only one column (“Petroleum Coke Consumed by the Commercial Sector”) having a significant amount of missing data (53%).
2.3.10 Petroleum Consumption by Industrial Sector Data
vis_miss(energy_df_monthly_petroleum_consumption_trans) +theme(axis.text.x =element_text(angle=270, hjust=1),axis.title =element_text(size=10),plot.title =element_text(hjust=0.5, face='bold', color='darkblue'),plot.subtitle =element_text(hjust=0.5, color='darkblue') ) +labs(title ="Percentage of datapoints missing in each column",subtitle ="Table: Petroleum Consumption By Transportation and Electric Power Sector Data (Monthly Data)",x ="Column Variables",y ="Number of Observations" )
Code
vis_miss(energy_df_annual_petroleum_consumption_trans) +theme(axis.text.x =element_text(angle=270, hjust=1),axis.title =element_text(size=10),plot.title =element_text(hjust=0.5, face='bold', color='darkblue'),plot.subtitle =element_text(hjust=0.5, color='darkblue') ) +labs(title ="Percentage of datapoints missing in each column",subtitle ="Table: Petroleum Consumption By Transportation and Electric Power Sector Data (Annual Data)",x ="Column Variables",y ="Number of Observations" )
Monthly Data:
High Percentage of Missing Data in “Other Products Consumed by the Transportation Sector”. This column has the highest percentage of missing data points, with 93% of the data missing. This could be due to various reasons, such as data collection limitations or the nature of the data itself.
Annual Data:
A few columns stand out with higher percentages of missing data. These include:
“Petroleum Coke Consumed by the Electric Power Sector” (28% missing)
“Other Products Consumed by the Transportation Sector” (96% missing)
“Jet Fuel Consumed by the Transportation Sector” (4% missing)
vis_miss(energy_df_monthly_natural_gas) +theme(axis.text.x =element_text(angle=270, hjust=1),axis.title =element_text(size=10),plot.title =element_text(hjust=0.5, face='bold', color='darkblue'),plot.subtitle =element_text(hjust=0.5, color='darkblue') ) +labs(title ="Percentage of datapoints missing in each column",subtitle ="Table: Natural Gas Overview Data (Monthly Data)",x ="Column Variables",y ="Number of Observations" )
Code
vis_miss(energy_df_annual_natural_gas) +theme(axis.text.x =element_text(angle=270, hjust=1),axis.title =element_text(size=10),plot.title =element_text(hjust=0.5, face='bold', color='darkblue'),plot.subtitle =element_text(hjust=0.5, color='darkblue') ) +labs(title ="Percentage of datapoints missing in each column",subtitle ="Table: Natural Gas Data (Annual Data)",x ="Column Variables",y ="Number of Observations" )
Monthly Data:
In the monthly data, there are 3 columns with missing data as per the graph. The missing percentage is 14%, 14% and 5% respectively.
Annual Data:
Most columns have no missing data: The majority of the columns have 0% missing data, represented by the full gray bars. Supplemental Gaseous Fuels has the highest missing data has 41% missing data, represented by the tall black bar. All other columns have 0% missing data: The remaining columns have no missing data.
vis_miss(energy_df_monthly_natural_gas_consumption) +theme(axis.text.x =element_text(angle=270, hjust=1),axis.title =element_text(size=10),plot.title =element_text(hjust=0.5, face='bold', color='darkblue'),plot.subtitle =element_text(hjust=0.5, color='darkblue') ) +labs(title ="Percentage of datapoints missing in each column",subtitle ="Table: Natural Gas Consumption By Sector Data (Monthly Data)",x ="Column Variables",y ="Number of Observations" )
Code
vis_miss(energy_df_annual_natural_gas_consumption) +theme(axis.text.x =element_text(angle=270, hjust=1),axis.title =element_text(size=10),plot.title =element_text(hjust=0.5, face='bold', color='darkblue'),plot.subtitle =element_text(hjust=0.5, color='darkblue') ) +labs(title ="Percentage of datapoints missing in each column",subtitle ="Table: Natural Gas Consumption By Sector Data (Annual Data)",x ="Column Variables",y ="Number of Observations" )
Monthly Data:
The dataset has a total of 600 observations. There is a small percentage of missing data across the columns.
The dataset has a relatively high percentage of missing data in two specific columns: “Natural Gas Consumed by the Transportation Sector, Vehicle Fuel” and “Natural Gas Consumed by the Other Industrial Sector, CHP.”
Annual Data:
The dataset has a total of 15 columns. 8.5% of the data points across all columns are missing. 91.5% of the data points are present.
Most columns have no missing data (0%). The column “Natural Gas Consumed by the Transportation Sector, Vehicle Fuel” has the highest percentage of missing data, with 55% of its values missing. The column “Natural Gas Consumed by the Other Industrial Sector, CHP” has 53% of its values missing.
vis_miss(energy_df_monthly_coal_cons) +theme(axis.text.x =element_text(angle=270, hjust=1),axis.title =element_text(size=10),plot.title =element_text(hjust=0.5, face='bold', color='darkblue'),plot.subtitle =element_text(hjust=0.5, color='darkblue') ) +labs(title ="Percentage of datapoints missing in each column",subtitle ="Table: Coal Consumption By Sector Data (Monthly Data)",x ="Column Variables",y ="Number of Observations" )
Code
vis_miss(energy_df_annual_coal_cons) +theme(axis.text.x =element_text(angle=270, hjust=1),axis.title =element_text(size=10),plot.title =element_text(hjust=0.5, face='bold', color='darkblue'),plot.subtitle =element_text(hjust=0.5, color='darkblue') ) +labs(title ="Percentage of datapoints missing in each column",subtitle ="Table: Coal Consumption By Sector Data (Annual Data)",x ="Column Variables",y ="Number of Observations" )
Monthly Data:
The dataset has 7.2% of data points missing across all columns. The most missing data is in the “Coal Consumed by the Residential Sector” column (32%), followed by “Coal Consumed by the Commercial Sector, CHP” and “Coal Consumed by the Other Industrial Sector, CHP” (both at 31%).
Annual Data:
The dataset has 9.8% of data points missing across all columns. The most missing data is in the “Coal Consumed by the Residential Sector” column (21%), followed by “Coal Consumed by the Commercial Sector, CHP” and “Coal Consumed by the Other Industrial Sector, CHP” (both at 53%).
vis_miss(energy_df_monthly_price) +theme(axis.text.x =element_text(angle=270, hjust=1),axis.title =element_text(size=10),plot.title =element_text(hjust=0.5, face='bold', color='darkblue'),plot.subtitle =element_text(hjust=0.5, color='darkblue') ) +labs(title ="Percentage of datapoints missing in each column",subtitle ="Table: Avergar Electricity Price Data (Monthly Data)",x ="Column Variables",y ="Number of Observations" )
Code
vis_miss(energy_df_annual_price) +theme(axis.text.x =element_text(angle=270, hjust=1),axis.title =element_text(size=10),plot.title =element_text(hjust=0.5, face='bold', color='darkblue'),plot.subtitle =element_text(hjust=0.5, color='darkblue') ) +labs(title ="Percentage of datapoints missing in each column",subtitle ="Table: Average Electricity Price Data (Annual Data)",x ="Column Variables",y ="Number of Observations" )
Monthly Data:
The majority of the columns have some missing data, with the exception of the “Month” column.
The column “Average Price of Electricity to Ultimate Customers, Other” has the highest percentage of missing data (73%).
The columns “Average Price of Electricity to Ultimate Customers, Transportation” and “Average Price of Electricity to Ultimate Customers, Residential,” “Average Price of Electricity to Ultimate Customers, Commercial,” and “Average Price of Electricity to Ultimate Customers, Industrial” have 50%, 13%, 13%, and 13% missing data, respectively.
Annual Data:
The column “Average Price of Electricity to Ultimate Customers, Transportation” has the highest percentage of missing data (67%).
The column “Average Price of Electricity to Ultimate Customers, Other” has 33% missing data.
2.3.18 Renewable Energy Production and Consumption by Sectors Data
vis_miss(energy_df_monthly_co2) +theme(axis.text.x =element_text(angle=270, hjust=1),axis.title =element_text(size=10),plot.title =element_text(hjust=0.5, face='bold', color='darkblue'),plot.subtitle =element_text(hjust=0.5, color='darkblue') ) +labs(title ="Percentage of datapoints missing in each column",subtitle ="Table: CO2 Emission Data (Monthly Data)",x ="Column Variables",y ="Number of Observations" )
Code
vis_miss(energy_df_annual_co2) +theme(axis.text.x =element_text(angle=270, hjust=1),axis.title =element_text(size=10),plot.title =element_text(hjust=0.5, face='bold', color='darkblue'),plot.subtitle =element_text(hjust=0.5, color='darkblue') ) +labs(title ="Percentage of datapoints missing in each column",subtitle ="Table: CO2 Emission Data (Annual Data)",x ="Column Variables",y ="Number of Observations" )
There is no missing data in any of the dataset as per the graph.
2.4 Combining Data
Managing 19 separate Excel files for data analysis can be a challenging and time-consuming process, as it requires handling multiple file operations, ensuring consistency across datasets, and potentially dealing with overlapping or missing data. A more efficient approach is to consolidate all the data into a single table structure. By doing so, we eliminate the need for complex and computationally expensive join operations, simplify the workflow, and significantly reduce the amount of code required to process and analyze the data. This approach not only streamlines the data analysis process but also minimizes errors and improves overall efficiency.